Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Dec 11 14:23:19 2022"

How are you feeling right now? - I’m feeling a bit overwhelmed but hopeful. What do you expect to learn? - Running data analysis with R. Where did you hear about the course? - This course was recommended on Kimmo’s other courses.

How did the R for Health Data Science book and the Exercise Set 1 work as a “crash course” on modern R tools and using RStudio? - This was an interesting addition to the course materials. I like how the book and the Exercise file are synchronized. Which were your favorite topics? - I found th different methods for filtering data useful. Which topics were most difficult? - Joining multiple data sets Some other comments on the book and our new approach of getting started with R Markdown etc.? - With little background on working with R there seems to be a lot to learn. Luckily there is no need to memorize everything.

Link to GitHub repository: https://github.com/AapoVir/IODS-project.git


Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Sun Dec 11 14:23:19 2022"
#install.packages("GGally")
library(tidyverse) 
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

1.Read the students2014 data into R either from your local folder

learning2014 <- read_csv("data/learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Explore the structure and the dimensions of the data and describe the dataset briefly

dim(learning2014)
## [1] 166   7
  #The dataset has 7 variables (columns) and 166 observations (rows)
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
  # The dataset has six numeric variables (age, attitude, deep, stra, surf, points) and one character/categorical variable (age).

2.Show a graphical overview of the data

ggpairs(learning2014, mapping = aes(col = gender,alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# Show summaries of the variables in the data
summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
type_sum(learning2014)
## [1] "spc_tbl_[,7]"

Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

Women outnumber men in the cohort. Distribution of age is skewed to the left. Age of participants ranges between 17 and 55 years, mean 25,51 and median 22,00 years. Attitude and points are positively correlated. Surface and strategic learning are negatively correlated.

3.Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable.

#access library
library(ggplot2)

#Fit a regression model for surface and deep questions as explanatory variables and points as the outcome variable
lm1 <- lm(points ~ surf+deep+age, data = learning2014)
lm1
## 
## Call:
## lm(formula = points ~ surf + deep + age, data = learning2014)
## 
## Coefficients:
## (Intercept)         surf         deep          age  
##    33.24979     -2.03323     -0.70497     -0.08905
#Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. 
summary(lm1)
## 
## Call:
## lm(formula = points ~ surf + deep + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.0771  -3.2767   0.2549   4.1802  10.6178 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.24979    5.07515   6.551 7.22e-10 ***
## surf        -2.03323    0.91707  -2.217    0.028 *  
## deep        -0.70497    0.86668  -0.813    0.417    
## age         -0.08905    0.05910  -1.507    0.134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.835 on 162 degrees of freedom
## Multiple R-squared:  0.03794,    Adjusted R-squared:  0.02012 
## F-statistic: 2.129 on 3 and 162 DF,  p-value: 0.09855

The test provides a multivariate linear regression model. Surface questions correlate negatively with the outcome variable ‘points’. The correlation is statistically significant. Deep questions or age have no statistically significant correlation with the outcome variable.

#If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
# Repeating the model without 'deep' and 'age'.
lm2 <- lm(points ~ surf, data = learning2014)
lm2
## 
## Call:
## lm(formula = points ~ surf, data = learning2014)
## 
## Coefficients:
## (Intercept)         surf  
##      27.202       -1.609

4.Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.

#Taking the summary of the new model.
summary(lm2)
## 
## Call:
## lm(formula = points ~ surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6539  -3.3744   0.3574   4.4734  10.2234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.2017     2.4432  11.134   <2e-16 ***
## surf         -1.6091     0.8613  -1.868   0.0635 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared:  0.02084,    Adjusted R-squared:  0.01487 
## F-statistic:  3.49 on 1 and 164 DF,  p-value: 0.06351

Removing ‘deep’ and ‘age’ from the model renders the association of surface questions with ‘points’ statistically insignificant (p= 0.06351).

R-squared is measure of accuracy for linear models. It tell to what extent the tested variable explains the changes in the output variable. Adjusted R-squared: 0.01487 indicates that about 15 % of the variance in ‘points’ is explained by the value of ‘surf’.

5.Produce the following diagnostic plots

#Residuals vs Fitted values
plot(lm2, which=c(1))

#Normal QQ-plot
plot(lm2, which=c(2))

#Residuals vs Leverage. 
plot(lm2, which=c(5))

Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.

The first diagnostic plot ‘Residuals vs Fitted’ plots the residuals against the fitted values of the response variable ‘points’. The variability of the residuals seems rather constant across different fitted values indicating that the fitted model is appropriate. In the probability plot for residuals, Normal Q-Q, the points all mostly in a straight line. The spread of standardized residuals do not change as a function of leverage meaning that variance of the explanatory variable ‘surf’ is rather constant for different values of the outcome variable ‘points’.


#install.packages("GGally")
library(tidyverse) 
library(GGally)
library(ggplot2)

#alc<-read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=";", header=TRUE)


# Alternative for downloading from working directory data folder
alc<-read.csv("data/alc.csv")

#Checking variable names
names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This data illustrate student achievement in secondary education of two Portuguese schools. Two datasets are provided for the performance in mathematics (mat) and Portuguese language (por). The data was acquired using school reports and questionnaires.

The aim of this analysis is to find associations between levels of alcohol consumption and other variables in the data set. Here we focus on the relationship of high and low alcohol consumption with age, travel time between home and school, weekly study time and number of past class failures. We hypothesize that age and past class failures are positively correlated with high alcohol consumption while study time might be negatively correlated. Travel time should not have any impact on alcohol consumption.

#Exploring the distributions of the chosen variables and their relationships with alcohol consumption: age

#Age vs high alcohol use, create cross-table for high and low alcohol use based on age
alc %>%
  group_by(high_use, age) %>%
  tally() %>%
  spread(high_use, n)
## # A tibble: 7 × 3
##     age `FALSE` `TRUE`
##   <int>   <int>  <int>
## 1    15      63     18
## 2    16      74     28
## 3    17      62     35
## 4    18      52     25
## 5    19       7      4
## 6    20       1     NA
## 7    22      NA      1
#create linear regression model
lm_age <-lm(high_use~age,data = alc)
summary(lm_age)
## 
## Call:
## lm(formula = high_use ~ age, data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4405 -0.3174 -0.2764  0.6416  0.7646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.37994    0.33479  -1.135   0.2572  
## age          0.04102    0.02015   2.036   0.0425 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4569 on 368 degrees of freedom
## Multiple R-squared:  0.01114,    Adjusted R-squared:  0.008452 
## F-statistic: 4.145 on 1 and 368 DF,  p-value: 0.04246
#Create box-plot
g1 <- ggplot(alc, aes(x = high_use, y = age))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("age")

Based on the cross table, students were 15-22 years old, most students report low alcohol consumption. The number of students reporting low alcohol use is higher towards younger age. According to the linear regression model age is statistically significantly associated with high alcohol consumption. The box plot illustrates that students reporting high alcohol use tend to be older.

#Exploring the distributions of the chosen variables and their relationships with alcohol consumption: traveltime

#travel time vs high alcohol use, create cross-table for high and low alcohol use based on traveltime
alc %>%
  group_by(high_use, traveltime) %>%
  tally() %>%
  spread(high_use, n)
## # A tibble: 4 × 3
##   traveltime `FALSE` `TRUE`
##        <int>   <int>  <int>
## 1          1     174     68
## 2          2      73     26
## 3          3      10     11
## 4          4       2      6
#create linear regression model
lm_traveltime <-lm(high_use~traveltime,data = alc)
summary(lm_traveltime)
## 
## Call:
## lm(formula = high_use ~ traveltime, data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5323 -0.2594 -0.2594  0.6496  0.7406 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.16849    0.05428   3.104  0.00205 **
## traveltime   0.09095    0.03378   2.692  0.00742 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.455 on 368 degrees of freedom
## Multiple R-squared:  0.01931,    Adjusted R-squared:  0.01665 
## F-statistic: 7.247 on 1 and 368 DF,  p-value: 0.007425
#Create box-plot
g2 <- ggplot(alc, aes(x = high_use, y = traveltime))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("traveltime")

Travel time is given numeric values from 1 to 4 based on following: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour. Most students have a less than 15 minute commute to school. According to the linear regression model travel time is positively correlated with high alcohol consumption and the association is statistically significant.

#Exploring the distributions of the chosen variables and their relationships with alcohol consumption: studytime

#study time time vs high alcohol use, create cross-table for high and low alcohol use based on study time
alc %>%
  group_by(high_use, studytime) %>%
  tally() %>%
  spread(high_use, n)
## # A tibble: 4 × 3
##   studytime `FALSE` `TRUE`
##       <int>   <int>  <int>
## 1         1      56     42
## 2         2     128     57
## 3         3      52      8
## 4         4      23      4
#create linear regression model
lm_studytime <-lm(high_use~studytime,data = alc)
summary(lm_studytime)
## 
## Call:
## lm(formula = high_use ~ studytime, data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4211 -0.3050 -0.1889  0.5789  0.9272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.53720    0.06094   8.815  < 2e-16 ***
## studytime   -0.11609    0.02755  -4.213 3.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4488 on 368 degrees of freedom
## Multiple R-squared:  0.04602,    Adjusted R-squared:  0.04343 
## F-statistic: 17.75 on 1 and 368 DF,  p-value: 3.17e-05
#Create box-plot
g3 <- ggplot(alc, aes(x = high_use, y = studytime))

# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("study time")

Weekly study time is given numeric values from 1 to 4 based on following: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours. Most students study 2 to 5 hours per week. According to the linear regression model study time is negatively correlated with high alcohol consumption and the association is statistically significant.

#Exploring the distributions of the chosen variables and their relationships with alcohol consumption: failures

#study time time vs high alcohol use, create cross-table for high and low alcohol use based on study time
alc %>%
  group_by(high_use, failures) %>%
  tally() %>%
  spread(high_use, n)
## # A tibble: 4 × 3
##   failures `FALSE` `TRUE`
##      <int>   <int>  <int>
## 1        0     238     87
## 2        1      12     12
## 3        2       8      9
## 4        3       1      3
#create linear regression model
lm_failures <-lm(high_use~failures,data = alc)
summary(lm_failures)
## 
## Call:
## lm(formula = high_use ~ failures, data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7409 -0.2703 -0.2703  0.5728  0.7297 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.27033    0.02477  10.915  < 2e-16 ***
## failures     0.15685    0.04211   3.725 0.000226 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4511 on 368 degrees of freedom
## Multiple R-squared:  0.03634,    Adjusted R-squared:  0.03372 
## F-statistic: 13.88 on 1 and 368 DF,  p-value: 0.0002259
#Create box-plot
g4 <- ggplot(alc, aes(x = high_use, y = failures))

# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ylab("failures")

Most students have not failed a class. According to the linear regression model class failures are positively correlated with high alcohol consumption and the association is statistically significant. I have no words for my attempted box plot.


#Install necessary packages
library(tidyverse) 
library(GGally)
library(ggplot2)
#install.packages(c("MASS", "corrplot"))

Clustering and classification

The Boston Data

This week’s assignment utilizes data from the housing values in suburbs of Boston to explore the association of the explanatory variables on the crime rates in different suburbs. The Boston data set has 506 rows of observations representing data from different suburbs. The data set has 14 columns containing following variables: crim :per capita crime rate by town. zn : proportion of residential land zoned for lots over 25,000 sq.ft. indus : proportion of non-retail business acres per town. chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox : nitrogen oxides concentration (parts per 10 million). rm : average number of rooms per dwelling. age : proportion of owner-occupied units built prior to 1940. dis : weighted mean of distances to five Boston employment centres. rad : index of accessibility to radial highways. tax : full-value property-tax rate per 10,000 usd. ptratio: pupil-teacher ratio by town. black : proportion of blacks by town lstat : lower status of the population (percent). medv : median value of owner-occupied homes in usd 1000s.

#Download the Boston data
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Taking an overview of the variables

The key statistical properties of the variables are presented in the Summary table above. The per capita crime rates vary between 0.00632 and 88.97620, the mean 3.61352 and median 0.25651 . The low median relative to the mean gives the impression that there are few suburbs with very high crime rates bringing the overall average higher. The proportion of industrial acres of zoned land use ranges between 0.46 and 27.74 %, mean 11.14 % and median 9.69 %. On average 68.57 % (median 77.50 %) of the owner-occupied houses were built before the 1940’s. The median value of the owner-occupied housing properties in the suburbs/towns ranges between 5.00 and 50.00 1000-usd, mean 22.53, median 21.20.

The correlation matrix below shows the associations of the variables in the data set. The size and intensity of the color of the dot denotes the strength of the association, red color indicating negative and blue positive correlation. Looking at the crime column we can see that the variables indus, nox, age, rad, tax, ptratio, and lstat are associated with higher crime rates. Interpreting these data we can postulate that areas with higher industrial land use, closer access to major highways and accordingly higher nitrogen oxides concentration in the air are more prone to higher criminality. Variables that were associated with lower crime rates are zn, rm, dis, black, medv. Put together the data suggest that areas zoned for houses on larger land slots, higher number of rooms per dwelling, longer distance to major employment centers, higher proportion of blacks in the town and more expensive housing had lower crime rates.

The table showing the numeric correlation matrix shows that rad (index of accessibility to radial highways) has the strongest positive association with criminality and median house price the strongest negative association with criminality.Other inferences can be read from the table as well such as the positive correlation between indus and rad (0.60) suggesting that industrial areas have bettew access to radial high ways, which makes logistically sense.

library(tidyr); library(dplyr); library(ggplot2)
library(ggplot2)
library(GGally)

#Examining the correlation of the variables
cor_matrix <- cor(Boston) 
round(cor_matrix, digits=2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")

#Plotting the distributions of the variables
ggpairs(Boston[1:7], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

#Histograms for plotting the distribution of the variables
par(mfrow=c(3,3)) #so variables are at one table
hist(Boston$crim, main = paste("crim"), xlab = NULL)
hist(Boston$zn, main = paste("zn"), xlab = NULL)
hist(Boston$indus, main = paste("indus"), xlab = NULL)
hist(Boston$chas, main = paste("chas"), xlab = NULL)
hist(Boston$nox, main = paste("nox"), xlab = NULL)
hist(Boston$rm, main = paste("rm"), xlab = NULL)
hist(Boston$age, main = paste("age"), xlab = NULL)
hist(Boston$dis, main = paste("dis"), xlab = NULL)

Standardizing the dataset

Standardizing the data changed the range of the column values which are now much smaller. The column means are 0. Next we create a categorical variable of the crime rate using quantiles as breaking points and replace the old crime variable with the new categorical variable. Then the data set is divided creating a test and training data set.

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# Save as data frame
boston_scaled<-as.data.frame(boston_scaled)

summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
#Setting quantiles as breaks for categoricals crime variable
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

#Creating traning set
library(dplyr)
library(MASS)
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",
                            sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

#Fitting the linead discriminate model

lda.fit <- lda(crime~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2400990 0.2400990 0.2673267 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.92980714 -0.9258533 -0.07933396 -0.8988515  0.4699139 -0.8800690
## med_low  -0.06532032 -0.3100251  0.01179157 -0.5768188 -0.1052403 -0.3849437
## med_high -0.38822494  0.2422142  0.17414622  0.3976446  0.1591027  0.4564593
## high     -0.48724019  1.0169921 -0.05360128  1.0397246 -0.3819122  0.7915953
##                 dis        rad        tax    ptratio       black        lstat
## low       0.8842760 -0.7082658 -0.7130674 -0.5065763  0.38009341 -0.764375911
## med_low   0.3456069 -0.5473481 -0.4455566 -0.0508894  0.34017665 -0.188797965
## med_high -0.3775725 -0.3993498 -0.2742829 -0.2856516  0.07052586 -0.007806074
## high     -0.8340967  1.6393984  1.5149640  0.7822555 -0.70836257  0.892334358
##                 medv
## low       0.53869659
## med_low   0.02613753
## med_high  0.19248303
## high     -0.69329702
## 
## Coefficients of linear discriminants:
##                  LD1          LD2        LD3
## zn       0.115560002  0.549153405 -0.9552562
## indus    0.007920383 -0.426190462  0.2007054
## chas    -0.107404810  0.011727132  0.1645388
## nox      0.475231432 -0.853000399 -1.1939660
## rm      -0.097149878 -0.155037427 -0.2261259
## age      0.279960711 -0.397045777 -0.3288585
## dis     -0.032528220 -0.297542279 -0.0796108
## rad      3.233128821  0.787526851 -0.2526551
## tax     -0.034937381  0.253683184  0.5652062
## ptratio  0.130675659  0.007615114 -0.1576227
## black   -0.127959932  0.020660774  0.1234552
## lstat    0.195298309 -0.091849318  0.4259614
## medv     0.212021308 -0.366378965 -0.1174221
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9532 0.0363 0.0105
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric

classes <- as.numeric(factor(train$crime))

# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

Linear discriminate analysis

The linear discriminate plot above shows the clusters of towns with high crime rate on the right side of the graph and other categories of crime rates (low, med low, med high) on the left. Arrows representing the effect size of each explanatory variable are drawn on the graph with rad having the longest arrow towards high crime rate. This means that for any given suburb easier access to the radial highways of greater Boston is associated with criminality.

#Reloading the Boston data and standardizing it
library(MASS)
data("Boston")
boston_scaled2<-as.data.frame(scale(Boston))

#Calculate distances
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#Calculating WCSS to determine optimal number of clusters
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

#K-means

km <- kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2[, 1:5], col = km$cluster)

#Repeating with two clusters
km <- kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2[, 1:5], col = km$cluster)

K-means clustering

Based on the calculations for within cluster sum of squares (WCSS) the optimal number of clusters is two. This is the point at which the value of total WCSS changes radically and the curve tilts the most. For the sake of comparison the k-means clustering is drawn with both four and two clusters. For sake of readability only the first five variables were included for this graphical data clustering.

# Drawing a scatter plot to see if older houses are less valuable than new ones
library(tidyverse)
clusters <- factor(km$cluster)
boston_scaled2 %>% ggplot(aes(x = age, y = medv, col = clusters)) +
  geom_point()

The scatter plot shows an example of clustering the observations based on the proportion of owner-occupied units built prior to 1940 (age) and median value of owner-occupied homes in usd 1000s (medv). The red cluster 1 in the lower right corner denotes the cheaper and older housing properties. The second green cluster is more evenlöy distributed.

Thank you for reading my report for the week assignment. Have an awesome day!


library(readr)
human<-read_csv("data/human.csv")
## Rows: 155 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): Edu2.FM, Labo.FM, Edu.Exp, Life.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(human)
## [1] "Edu2.FM"   "Labo.FM"   "Edu.Exp"   "Life.Exp"  "GNI"       "Mat.Mor"  
## [7] "Ado.Birth" "Parli.F"
str(human)
## spc_tbl_ [155 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Edu2.FM  : num [1:155] 1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num [1:155] 0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num [1:155] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num [1:155] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : num [1:155] 64992 42261 56431 44025 45435 ...
##  $ Mat.Mor  : num [1:155] 4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num [1:155] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num [1:155] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Edu2.FM = col_double(),
##   ..   Labo.FM = col_double(),
##   ..   Edu.Exp = col_double(),
##   ..   Life.Exp = col_double(),
##   ..   GNI = col_double(),
##   ..   Mat.Mor = col_double(),
##   ..   Ado.Birth = col_double(),
##   ..   Parli.F = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Dimensionality reduction techniques

Overview of the Human data

The Human data is a composite of two joined data sets: the Human Development Index (HDI) data and Gender Inequality Index (GII) data. The HDI and GII data originate from the United Nations Development Programme (UNDP).

Human Development Index (HDI)

The UNDP created the HDI emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. The HDI takes into account a nation’s longevity, education and income. The health dimension is assessed by life expectancy at birth, the education dimension is measured by mean of years of schooling for adults aged 25 years and more and expected years of schooling for children of school entering age. The standard of living dimension is measured by gross national income per capita.

Gender Inequality Index (GII)

GII is a composite metric of gender inequality using three dimensions: reproductive health, empowerment and the labour market. The indicators of the reproductive health dimension are maternal mortality ratio and adolescent birth rate. The indicators of the empowerment dimension are proportions of women and men with at least secondary education and the shares of parliamentary seats. Ther indicator of the labour market dimension is the labour force participation rate for women and men.

Variable descriptions

  • There are 155 observations in the data.
  • Human data set contains the following variables:
  • Edu2.FM = The ratio of women to men with at least secondary education
  • Labo.FM = The ratio of women to men participating in the labour force
  • Edu.Exp = Mean years of schooling for adults aged 25 years or more and expected years of schooling for children of school entering age
  • Life.Exp = Life expectancy at birth
  • GNI = Gross national income per capita
  • Mat.Mor = Maternal mortality meaning the death of a woman during pregnancy, at delivery, or soon after delivery
  • Ado.Birth = Adolescent giving birth
  • Parli.F = Share of women holding parliamentary seats

Variable characteristics and distributions

  • The ratio of women to men with at least secondary education (Edu2.FM) ranges from 0.1717 to 1.4967, mean 0.8529 and median 0.9375. Globally more men than women hold degrees from at least secondary education.
  • The ratio of women to men participating in the labour force ranges between 0.1857 and 1.0380, mean 0.7074 and median 0.7535. Globally more men participate in the labour force than women.
  • The mean years of schooling range from 5.40 to 20.20 years, mean 13.18 and median 13.5. The mean and median are almost the same and close to the middle of the range suggesting that the values are normally distributed as shown in the histogram below.
  • The Gross national income per capita ranges between 581 and 123124 usd, mean 17628 and 12040. A histogram for this variable is shown below indicating that the GNI is heavily skewed to the left with most countries in the lowest earning bracket and just a few rich countries.
  • Maternal mortality rangers from 1 to 1100, mean 149.1 and median 49.0.
  • Adolescent birth rate ranges from 0.60 to 240.80, mean 47.16 and median 33.60.
  • Woman’s proportion in parliament ranges from 0 to 57.50 %, mean 20.91 amd median 19.30.
#Histogram
par(mfrow=c(3,3))
hist(human$Edu2.FM,col = 1, main = paste("Women's education ratio"), xlab= "Ratio of women with at least sec education", ylab = "n countries")
hist(human$Labo.FM,col = 2, main = paste("Life expextancy at birth"), xlab= "years", ylab = "n countries")
hist(human$Edu.Exp,col = 3, main = paste("Mean years of schooling"), xlab= "years", ylab = "n countries")
hist(human$Life.Exp,col = 4, main = paste("Life expextancy at birth"), xlab= "years", ylab = "n countries")
hist(human$GNI,col = 5, main = paste("GNI per capita"), xlab= "US dollars", ylab = "n countries")
hist(human$Mat.Mor,col = 6, main = paste("Maternal mortality"), xlab= "n of maternal deaths per 100,000 live births", ylab = "n countries")
hist(human$Ado.Birth,col = 7, main = paste("Adolescent birth rate"), xlab= "years", ylab = "n countries")
hist(human$Parli.F,col = 8, main = paste("Woman in parliament"), xlab= "share of parliamentary seats", ylab = "n countries")

library(tidyr); library(dplyr); library(ggplot2)
library(ggplot2)
library(GGally)
library(corrplot)
library(readr)

#Download data
human<-read_csv("data/human.csv")
## Rows: 155 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): Edu2.FM, Labo.FM, Edu.Exp, Life.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Take a summary of the data
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#A matrix of scatterplots
pairs(human[-1],)

cor_matrix<-cor(human[-1],)
round(cor_matrix, digits=2)
##           Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Labo.FM      1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp    -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI         -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor      0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.25    0.21     0.17  0.09   -0.09     -0.07    1.00
corrplot(cor_matrix,method="square", type="lower",cl.pos="b", tl.pos="lt",tl.cex=0.8)

Variable correlations

  • The correlation matrix shows that maternal mortality is negatively correlated with expected years of education (-0.66), with life expectancy (-0.86) and GNI (-0.50).
  • Similarly, adolescent births are negative correlated with expected years of education (-0.53), with life expectancy (-0.73) and GNI (-0.56).
  • There is a clear positive correlation GNI and womens education (0.43), expected years of schooling (0.62), life expectancy (0.63).

Principal components analysis

Principal component analysis (PCA) is a technique for analyzing large datasets containing a high number of dimensions per observation. PCA can increase interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. PCA

First we attempt a PCA on the Human data without scaling the data first. This clusters the countries in the upper right corner with a lot of overlap so we can’t discern how the dimensions impact our observations. Only countries representing some extremes in the data are seen separate from the main cluster (eg. Qatar, Kuwait, Sierra Leon etc).

library(readr)
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", sep =",", header = TRUE)
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex=c(0.5,0.8), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

After standardizing the data we can see that the different countries are more spread out along the axes of the two principal components. The arrows represent the variables of the data. The variation in the data can be illustrated with the two principal components better after the standardization. We can see countries with high standards of living, long life expectancy and higher education clustered to the left and countries with lower standards of living, shorter life expectancy and lower education to the right in the graph. The wealthy countries are further divided in the vertical axes based on gender inequality with many Nordic and Western countries in the upper-left quadrant representing countries with better gender equality. Wealthy countries with poor gender equality are clustered toward th lower left quadrant e.g. the gulf countries. Afghanistan and Niger int he lower right corner of the PCA plot represent countries with low income and high gender inequality.

In a nut shell, we can view the PC1 as the ‘wealth and health’ axes and the PC2 as the ‘gender equality’ axes.

# standardize the variables
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", sep =",", header = TRUE)
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex=c(0.5,0.8), col = c("grey40", "deeppink2"))

library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
tea <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", 
                  sep = ",", header = T)
tea$Tea <- factor(tea$Tea)
tea$How <- factor(tea$How)
tea$how <- factor(tea$how)
tea$sugar <- factor(tea$sugar)
tea$where <- factor(tea$where)
tea$lunch <- factor(tea$lunch)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : chr  "breakfast" "breakfast" "Not.breakfast" "Not.breakfast" ...
##  $ tea.time        : chr  "Not.tea time" "Not.tea time" "tea time" "Not.tea time" ...
##  $ evening         : chr  "Not.evening" "Not.evening" "evening" "Not.evening" ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : chr  "Not.dinner" "Not.dinner" "dinner" "dinner" ...
##  $ always          : chr  "Not.always" "Not.always" "Not.always" "Not.always" ...
##  $ home            : chr  "home" "home" "home" "home" ...
##  $ work            : chr  "Not.work" "Not.work" "work" "Not.work" ...
##  $ tearoom         : chr  "Not.tearoom" "Not.tearoom" "Not.tearoom" "Not.tearoom" ...
##  $ friends         : chr  "Not.friends" "Not.friends" "friends" "Not.friends" ...
##  $ resto           : chr  "Not.resto" "Not.resto" "resto" "Not.resto" ...
##  $ pub             : chr  "Not.pub" "Not.pub" "Not.pub" "Not.pub" ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : chr  "p_unknown" "p_variable" "p_variable" "p_variable" ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : chr  "M" "F" "F" "M" ...
##  $ SPC             : chr  "middle" "middle" "other worker" "student" ...
##  $ Sport           : chr  "sportsman" "sportsman" "sportsman" "Not.sportsman" ...
##  $ age_Q           : chr  "35-44" "45-59" "45-59" "15-24" ...
##  $ frequency       : chr  "1/day" "1/day" "+2/day" "1/day" ...
##  $ escape.exoticism: chr  "Not.escape-exoticism" "escape-exoticism" "Not.escape-exoticism" "escape-exoticism" ...
##  $ spirituality    : chr  "Not.spirituality" "Not.spirituality" "Not.spirituality" "spirituality" ...
##  $ healthy         : chr  "healthy" "healthy" "healthy" "healthy" ...
##  $ diuretic        : chr  "Not.diuretic" "diuretic" "diuretic" "Not.diuretic" ...
##  $ friendliness    : chr  "Not.friendliness" "Not.friendliness" "friendliness" "Not.friendliness" ...
##  $ iron.absorption : chr  "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" ...
##  $ feminine        : chr  "Not.feminine" "Not.feminine" "Not.feminine" "Not.feminine" ...
##  $ sophisticated   : chr  "Not.sophisticated" "Not.sophisticated" "Not.sophisticated" "sophisticated" ...
##  $ slimming        : chr  "No.slimming" "No.slimming" "No.slimming" "No.slimming" ...
##  $ exciting        : chr  "No.exciting" "exciting" "No.exciting" "No.exciting" ...
##  $ relaxing        : chr  "No.relaxing" "No.relaxing" "relaxing" "relaxing" ...
##  $ effect.on.health: chr  "No.effect on health" "No.effect on health" "No.effect on health" "No.effect on health" ...

Overview of the Tea data

  • The Tea dataset contains the answers of a questionnaire on tea consumption.
  • There are 300 observations and 36 variables in the dataset.

For the visualizing of the data we choose 6 interesting variables (how, How, pub, sugar, Tea and where):

  • Most tea is consumed from tea bags as opposed to loose tea.
  • Most people drink tea ‘alone’, that is, without lemon, milk or other additional flavoring.
  • Only a small fraction drink their tea at a pub.
  • The answers were almost split between sugar and no sugar use with tea.
  • Earl Gray appears to be the most popular tea label in this poll.
  • Most tea was bought from a chain store while only a small proportion of the respondents purchase tea from a tea shop.
library(ggplot2)
pivot_longer(tea, cols = 12:17) %>% 
  ggplot+geom_bar()+(aes(value)) + facet_wrap("name", scales = "free")+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Multiple correspondence analysis

Multiple correspondence analysis (MCA) is a method to analyze qualitative data used to detect patterns or structure in the data as well as in dimension reduction. The two axes of the MCA represent The first dimension (Dim1) is the most important dimension in terms of the amount of variance accounted for.

In the Tea data we can observe the most common characteristics clustered around the center of the graph (Dim1=0, Dim2=0). The common tea consumer habit includes Eearl Grey bought from a chain store drank as is or with milk. The more distinguishing features of tea consumption are spread out from the center of the MCA factor map. Looking at the lower right corner we can identify a cluster of variables depicting a true tea devotee who buys his/her loose leaf green tea from a designated tea shop.

library(dplyr)
library(tidyr)
library(FactoMineR)
tea_time<-tea[12:17]
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.281   0.267   0.222   0.186   0.179   0.154   0.144
## % of var.             15.325  14.547  12.130  10.166   9.773   8.390   7.844
## Cumulative % of var.  15.325  29.872  42.002  52.168  61.941  70.331  78.176
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.136   0.115   0.087   0.062
## % of var.              7.422   6.259   4.751   3.391
## Cumulative % of var.  85.598  91.857  96.609 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## 1         | -0.382  0.173  0.140 | -0.266  0.088  0.068 |  0.223  0.074  0.048
## 2         | -0.330  0.129  0.070 | -0.202  0.051  0.026 |  0.758  0.862  0.369
## 3         | -0.435  0.224  0.313 | -0.210  0.055  0.073 |  0.108  0.018  0.019
## 4         | -0.578  0.396  0.534 | -0.159  0.032  0.040 | -0.314  0.148  0.157
## 5         | -0.435  0.224  0.313 | -0.210  0.055  0.073 |  0.108  0.018  0.019
## 6         | -0.435  0.224  0.313 | -0.210  0.055  0.073 |  0.108  0.018  0.019
## 7         | -0.435  0.224  0.313 | -0.210  0.055  0.073 |  0.108  0.018  0.019
## 8         | -0.330  0.129  0.070 | -0.202  0.051  0.026 |  0.758  0.862  0.369
## 9         |  0.256  0.078  0.037 |  0.683  0.584  0.265 |  0.362  0.197  0.075
## 10        |  0.544  0.351  0.181 |  0.461  0.266  0.130 |  0.785  0.923  0.376
##            
## 1         |
## 2         |
## 3         |
## 4         |
## 5         |
## 6         |
## 7         |
## 8         |
## 9         |
## 10        |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test  
## Not.pub   |  -0.102   0.486   0.039  -3.414 |  -0.191   1.801   0.137  -6.406 |
## pub       |   0.383   1.827   0.039   3.414 |   0.719   6.776   0.137   6.406 |
## black     |   0.426   2.658   0.059   4.217 |  -0.093   0.134   0.003  -0.922 |
## Earl Grey |  -0.198   1.501   0.071  -4.605 |   0.238   2.270   0.102   5.519 |
## green     |   0.204   0.272   0.005   1.240 |  -1.181   9.585   0.172  -7.178 |
## alone     |  -0.059   0.136   0.007  -1.397 |  -0.218   1.923   0.088  -5.127 |
## lemon     |   0.841   4.610   0.087   5.110 |   0.518   1.843   0.033   3.147 |
## milk      |  -0.351   1.531   0.033  -3.126 |   0.139   0.253   0.005   1.239 |
## other     |   0.657   0.768   0.013   1.997 |   1.843   6.367   0.105   5.604 |
## No.sugar  |   0.221   1.491   0.052   3.943 |  -0.076   0.189   0.006  -1.366 |
##             Dim.3     ctr    cos2  v.test  
## Not.pub     0.181   1.932   0.123   6.059 |
## pub        -0.680   7.269   0.123  -6.059 |
## black       1.055  20.586   0.365  10.442 |
## Earl Grey  -0.462  10.303   0.385 -10.735 |
## green       0.337   0.937   0.014   2.050 |
## alone       0.021   0.021   0.001   0.484 |
## lemon      -1.325  14.468   0.217  -8.053 |
## milk        0.342   1.845   0.031   3.053 |
## other       2.015   9.131   0.126   6.128 |
## No.sugar    0.577  12.894   0.356  10.317 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## pub       | 0.039 0.137 0.123 |
## Tea       | 0.075 0.192 0.425 |
## How       | 0.119 0.166 0.340 |
## sugar     | 0.052 0.006 0.356 |
## how       | 0.686 0.445 0.040 |
## where     | 0.716 0.653 0.051 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

Countless cups of tea were consumed during the completion of this assignment


Analysis of longitudinal data

This week we delve into different techniques to analyze longitudinal data where observations are collected on several different occasions over a period of time. First we look into ways to visualize such data.

Graphical Displays of Longitudinal Data

In this part of the assignment we use data measuring weight gain in rats on different diets over 9-week period. The rats were divided into three groups.

  • There were eight rats in the 1st group, four rats in the 2nd and 3rd groups, altogether 16 rats.
  • At baseline the rats weighed between 225 g and 555 g, mean 365.9 and median 340.0.
  • At the end of the study the rats weighed between 245 g and 628 g, mean 404.1 and median 378.0.

Let’s visualize the weight gain responses of the three groups of our rodents.

# Get RATS data and convert into log form RATSL
library(dplyr); library(tidyr); library(ggplot2)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- pivot_longer(RATS, cols=-c(ID,Group), names_to = "WD",values_to = "Weight")  %>%  mutate(Time = as.integer(substr(WD,3,4))) %>% arrange(Time)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

#Looking at the data
glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1   <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470,…
## $ WD8   <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 465,…
## $ WD15  <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 475,…
## $ WD22  <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 485,…
## $ WD29  <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 487,…
## $ WD36  <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 493,…
## $ WD43  <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 493,…
## $ WD44  <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 504,…
## $ WD50  <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 507,…
## $ WD57  <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 518,…
## $ WD64  <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 525,…
summary(RATS)
##        ID     Group      WD1             WD8             WD15      
##  1      : 1   1:8   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  2      : 1   2:4   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  3      : 1   3:4   Median :340.0   Median :345.0   Median :347.5  
##  4      : 1         Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  5      : 1         3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  6      : 1         Max.   :555.0   Max.   :560.0   Max.   :565.0  
##  (Other):10                                                        
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##                                                                 
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0  
## 
names(RATS)
##  [1] "ID"    "Group" "WD1"   "WD8"   "WD15"  "WD22"  "WD29"  "WD36"  "WD43" 
## [10] "WD44"  "WD50"  "WD57"  "WD64"
# Individual response profiles by group for the RATS data.
names(RATSL)
## [1] "ID"     "Group"  "WD"     "Weight" "Time"
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() + 
  scale_linetype_manual(values = rep(1:16, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

The graph shows that some weight gain was observed in all groups of rats. The lines representing weight of a single rat are closer together in group 1 while the weights in group 2 and 3 are have a wider range. Due to inter group differences the scale is wide making smaller changes harder to discern from this visualization.

Next let’s see how standardizing the Rats data changes the response profile of the rats’ weights.

# Same plots after standardizing the data
RATSL <- RATSL %>%
  group_by(Group) %>%
  mutate(stdWeights = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()
#summary(RATSL)
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = stdWeights, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:16, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = " rats standardized")

Now it is easier to see how the rats’ weights developed over time. A tracking phenomenon — the tendency of rats with higher weight at baseline to have higher weight throughout the study — is more discernible in the standardized plot.

library(dplyr)
library(tidyr)
str(RATSL)
## tibble [176 × 6] (S3: tbl_df/tbl/data.frame)
##  $ ID        : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group     : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD        : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight    : int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
##  $ Time      : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stdWeights: num [1:176] -1.733 -2.829 -1.367 -0.271 -0.637 ...
#Summary data with means only without standard error bars
RATSS <- RATSL %>%
  group_by(Group,Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(4) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se    <dbl> 7.610789, 6.546537, 5.737953, 6.800407, 5.528740, 5.891883, 5.47…
# Plot the mean profiles
#x = Time, y = stdWeights, linetype = ID
library(ggplot2)
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  #theme(legend.position = c(0.8,0.8,0.8)) +
  scale_y_continuous(name = "mean")

The plot whos a gradual weight gain of rats in all groups over the course of the 64 days follow-up.

library(dplyr)
library(tidyr)
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
RATS_Sum <- RATSL %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATS_Sum )
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(RATS_Sum, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight)")

Here we have drawn boxplots for the summary measures of the three groups showing mean weights of each group. We can see that the mean weights of the rats vary between groups a lot with little overlap between groups. The first group has the lightest rats while the third group has the heaviest rats. Group two mean has the widest distribution due to one outlier rat. Next, lets see how removing the fat rat from group 2 changes the boxplot of summary means.

# filtering the outlier from Group 2
RATS_Sum <- filter(RATS_Sum, mean<550)
library(ggplot2)
ggplot(RATS_Sum, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight")

The summary measure varies less in group 2 after removing the outlier making the boxpltos more equal in width for all groups. The relative order of the ascending mean weights from group 1 to 3 remains. To make sure the visually discernible difference is true, let’s perform an analysis of variance (anova) as a formal test for a difference. In the anova output below the p-value (3.387e-12) indicates a statistically significant difference in the means between the groups. The F value (or F statistic) is far from 1 indicating that the variation of the group means is large and significant.

library(dplyr)
library(tidyr)
# Fit the linear model with the mean as the response 
fit <- lm(mean ~Group, data = RATS_Sum)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 206390  103195   483.6 3.387e-12 ***
## Residuals 12   2561     213                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear mixed effects models

This part of the assignment uses data from BPRS data. Here 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured at baseline (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom each rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.

First we will plot our data showing the change of bprs scores over time. The slightly overlapping confidence interval shading makes the graph a bit hazy but we can observe a slight decline in the bprs values over time.

library(dplyr); library(tidyr)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <-  pivot_longer(BPRS, cols=-c(treatment,subject),names_to = "weeks",values_to = "bprs") %>% arrange(weeks)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
rm(BPRS)
#str(BPRSL)
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Linear regression model

Ignoring the fact that the repeated measurements were done on the same individuals we analyse the data with multiple linear regression first. Looking at the summary of the regression model from the BPRS data where Brief Psychiatric Rating Scale score is the response variable and week and treatment are explanatory variables we can see that the variable week is negatively correlated with the BPRS score (t-value -8.995) and the association is statistically significant (p<2e-16). It appears that over time the patients get better regardless of the intervention. Treatment 2 showed no statistically significant effect on the BPRS score. However, the model makes the unlikely assumption that the repeated measures of bprs scores are independent.

BPRS_reg <- lm(bprs ~ week + treatment, data=BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Random Intercept Model

Because the observations of the bprs scores are not independent of one another we need additional models. First we shall create a random intercept model where (1 | subject) denotes the random-effects term. The summary of the random intercept model (BPRS_ref) shows similar findings than the previous multiple linear model: time appears to be negatively correlated with bprs score. Looking at the standard error of time (week) in the two models, 0.2524 for the linear model 0.2084 for the random intercept model, there is slightly larger in the former. This is because the assumption of independence of a within-subject covariate leads to a larger standard error.

library(dplyr); library(tidyr)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <-  pivot_longer(BPRS, cols=-c(treatment,subject),names_to = "weeks",values_to = "bprs") %>% arrange(weeks)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))

#create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Random Intercept and Random Slope Model

Here we fit a random intercept and random slope model with week and subject as the random effects. The association of the week with bprs score is similar that in the previous models (t-value -7.433). The likelihood ratio test provides a Chi-squared 7.2721 and a small p-valua 0.02636. Therefore the random intercept and random slope model appears to provide a superior fit compared to the random intercept model. The ouptus are given below.

# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = TRUE)
summary(BPRS_ref1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
## REML criterion at convergence: 2727.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8836 -0.6117 -0.0651  0.5423  3.7925 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 69.144   8.315         
##           week         1.052   1.026    -0.52
##  Residual             97.736   9.886         
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1568  21.538
## week         -2.2704     0.3055  -7.433
## treatment2    0.5722     1.0421   0.549
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.583       
## treatment2 -0.242  0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## refitting model(s) with ML (instead of REML)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparison of Random Intercept and Slope Models with interaction

Fianlly, let’s add the interaction of weeks and treatment to the mode and create a random intercept and slope mode with interaction. Comparing the two models the random intercept and random slope model with interaction (BPRS_ref2) does not provide any better fit than the corresponding comparison model without the interaction as an explanatory variable (BPRS_ref1). This is evidenced my the small Chi-squared value 3.1712 and the high p-value 0.07495. The outputs to this analysis are shown below.

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# RATS and RATSL are available

# create a random intercept and random slope model with the interaction
library(lme4)
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + (week*treatment), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + (week * treatment)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## refitting model(s) with ML (instead of REML)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + (week * treatment)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plotting the observed and fitted values

Here we plot the bprs scores as observed and with fitted values. The descending slope is more apparent in the fitted plot.

# draw the plot of BPRSL with the observed bprs score values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_smooth() +
  scale_x_continuous(name = "weeks", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "top")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)

library(dplyr)
library(tidyr)
# Create a new column fitted to RATSL
#RATSL$Fitted_values<-Fitted
BPRSL<-BPRSL%>%
  mutate(Fitted)

# draw the plot of BPRSL with the Fitted values of weight
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
  geom_smooth() +
  scale_x_continuous(name = "weeks", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "Fitted bprs") +
  theme(legend.position = "top")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Thanks for the peer review

Merry Christmas